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Abstract 

The aim of the paper is to separate handwritten and 
printed text from a real document embedded with noise, 
graphics including annotations. Relying on run-length 
smoothing algorithm (RLSA), the extracted pseudo- 
lines and pseudo-words are used as basic blocks for 
classification. To handle this, a multi- class support 
vector machine (SVM) with Gaussian kernel performs 
a first labelling of each pseudo-word including the study 
of local neighbourhood. It then propagates the context 
between neighbours so that we can correct possible la- 
belling errors. Considering running time complexity 
issue, we propose linear complexity methods where we 
use k-NN with constraint. When using a kd-tree, it is 
almost linearly proportional to the number of pseudo- 
words. The performance of our system is close to 90%, 
even when very small learning dataset are used, where 
samples are basically composed of complex administra- 
tive documents. 

1 Introduction 

Under the purview of document analysis and pro- 
cessing, we are in this paper, motivated to separate 
handwritten and machine-printed text (HfoV) so that 
further processing is feasible such as document infor- 
mation exploitation and retrieval. In other words, such 
a separation is an important step in the process be- 
cause it allows retro-conversion to avoid heavy treat- 
ments and errors when transcribing the content. 

Considering a continuous flow of administrative doc- 
uments into our system, we face a varieties of docu- 
ment types, content, quality and structure. Funda- 
mentally speaking, documents can be skewed, noisy 
and sometimes overlapped with graphics i.e., lines and 
unconstrained annotations. In this context, most of 
the image samples are required to be properly treated. 
Without integrating such tools, our system, in this 
framework, aims to extract the annotations whatever 
the language: French, German and English used in 
the document, the content: typed or handwritten, and 
document structure: structured (e.g. tables), semi- 
structured (e.g. forms) and structure-free. Although 
the segmentation topic has been studied since several 
years [1], different methods have been proposed to solve 
particular aspects of the separation [2,3]. Heteroge- 
neous document separation still remains an open prob- 
lem. Another strong industrial constraint is to reduce 
running time so that the system can maintain speed. 
In addition, parameter-free methods are always better 
since they can generally be applied. In this paper, we 
are motivated by the work of Kandan et al. [4] where 
separation has been made into two classes by using de- 
scriptors that are insensitive to translation, rotation 
and scaling. Classifications using SVM and fc-NN are 
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Figure 1. Work- flow showing several consecutive 
stages, starting from pre-processing to output 
i.e., 1-LhV text separation. 



first investigated, and a re-classification step is then 
performed using a Delaunay triangulation. Zheng et 
al. proposed two segmentation approaches and eval- 
uated over noisy documents [5]. The first one is used 
to determine the most appropriate segmentation where 
a comparison is made between the segmentation into 
words, lines and connected components. The latter one 
deals with word classification by selecting 31 descrip- 
tors over a hundred. They also introduce information 
about class in order to take the noise into account. 
Fisher classifier is used to label the segmented blocks 
and Markov field then allows fine classification, consid- 
ering the contextual information of each word. 

The rest of this paper is organised as follows. We 
start with detailing our proposed approach in Sec- 
tion 2. It mainly includes pre-processing, pseudo-word 
segmentation, word model training, word classification 
and pseudo-word grouping. Full experimental results 
(and of course, analysis) are reported in Section 3. The 
paper is concluded in Section 4 including a few perspec- 
tives. 

2 The proposed approach 

As illustrated in Fig. 1, our proposed approach con- 
sists of several consecutive steps. It includes pre- 
processing, pseudo-word segmentation, word model 
training, word classification and context propagation. 
In what follows, we explain them, one after another. 

Preprocessing. The low quality documents require a 
significant preprocessing. Our pre-processing is com- 
posed of the following steps: 

1. edge removal by using a rule system based on 
shape and position of the connected components 
(CC); 

2. noise filtering by using a modified kfill [6]; 

3. slope detection by using the RAST method [7]; 
and 

4. filtering by using the modified k-flll on the de- 
skewed document. 

Pseudo-word segmentation. In this section, we cre- 
ate regular and stable areas that will be used to label 
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Algorithm 1 Segmentation by double smearing 
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Figure 2. An example showing pre-processing: 
(a) input sample and (b) its corresponding out- 
put. 
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Figure 3. Segmentation comparison: (a) classical 
RLSA and (b) double RLSA. Extracted pseudo- 
words are framed and lines are identified by the 
color of the pseudo- words. 



the H&zV zones in the document image. To handle this, 
we use a double RLSA as presented in Algorithml i.e., 
it aims to provide fine word segmentation. 

In each one of the extracted lines, smearing is per- 
formed first and the distances between the bounding 
boxes of the adjacent CC are then calculated. This al- 
lows to construct a histogram that generally provides 
an overall shape appearance. It contains two dominant 
peaks: 

1. the first corresponds to the most frequent gap be- 
tween CC that can be considered as the distance 
between characters of the same word; and 

2. the second peak corresponds to the most frequent 
gaps between words belonging to the same row. 

Note that the first peak can be considered as the dis- 
tance between the letters in every word and in a similar 
fashion, the second peak determines the threshold to 
be used in pseudo-word segmentation. We can there- 
fore apply a second smearing that allows a finer seg- 
mentation because handwritten and printed words do 
not respect similar (usual) distances between the let- 
ters and words, and thus we are able to adapt the row 
content segmentation. Fig. 3 illustrates the compari- 
son between the original and the double RLSA. In this 
illustration, it is important to notice that words are 
well segmented in case when double RLSA is used in 
contrast to text block (that sometimes contains several 
words within it) from classical RLSA. 

Word model training. As said before, we need re- 
liable models to separate 1-ihV information. In order 
to have these models, we perform two classes of learn- 
ing from samples by taking words representatives. We 



lines <— smearing (image) 
for all line L in lines do 

list -edistances ^— 

for all CC c in line L do 

dmin distmin(listeccx , c) 
list .edistances <— add(list .edistances , d m i n ) 
end for 

compt <— bincount(list-edistance) 
histo <— compt[2 :: 2] + compt[3 :: 2] 
i <— argmax(histo) 
repeat 

previous <— histo[i] 

i i + 1 
until histo[i] > previous 
d hs <- i + 2 
end for 



then select several specific descriptors belonging to four 
different categories: 

1. morphological (local properties of pseudo- words 
such as height, width and pixel number); 

2. CC descriptors (11 descriptors as proposed in [5]); 

3. pixel repartition (global descriptors like invari- 
ant HU moments, variance of the projection pro- 
files [4,8]; and 

4. other local properties such as run length, cross- 
ing count and bi- level co-occurrences, as described 
in [5]. 

Classification. To handle pseudo-word classification, 
we employ a SVM. Although it is initially suggested to 
separate only the H&V information, we use a multi- 
class SVM so that an additional class i.e., noise can be 
taken into account. To handle this, two approaches are 
basically used: 1) the combination of bi-class SVM and 
2) the learning of a unique multi-class SVM (MSVM). 
MSVM is based on a principle similar to one-vs-all [9] 
where each class has its own decision function and the 
class corresponding to the function giving the highest 
value wins. The difference is that, for a MSVM with 
Q classes, the Q functions are learnt at the same time 
with exactly similar constraints. A single optimiza- 
tion problem is solved by using the maximization of 
the sum of the margins for each class. There are four 
different methods that differ in terms of application 
penalty. We use the tool presented by Weston and 
Watkins [10] where it cumulates the penalty compared 
to the margins of each class. The implementation is 
carried out on the Weka platform and the SMO classi- 
fier with the extension of the problem into three classes 
by the method one-vs-one as described in Mayoraz et 
al. [11]. 

Pseudo-word grouping. This re-grouping method 
uses spatial proximity to re-group elementary units. 
For each component, k nearest neighbours are found 
and the label of the component is compared with the 
ones in their neighbours. If more than 50% of the 
neighbours share the same label, this label is assigned 
to the central component. 

Generally speaking, since text is written horizon- 
tally, horizontal proximity between components is pre- 
ferred to be vertical ones. Then, we define the distance 
as 



d(ei, e 2 ) = yj (xi - x 2 ) 2 wl + (y 1 - y 2 ) 2 w 2 , (1) 



Require: VcGC, okLlabel(c) G (L) 
for all ceC do 

Neighb ^— k -nearest -neighbour (k, c, raa:c_dist.) 
n = card (Neighb) 
newJabel[c] <— oid_ia6ei[c] 
for all c/ass G (L) do 

N c ^— G Neighb, oldJabel[x] = class} 
if card(N c ) > f then 

if E*eiv c area(x) > |(c) then, 

newJabel[c] <— dass 
end if 

break 
end if 
end for 
end for 



where a^, ^ are the coordinates of the center of gravity 
of CC rii, and w x . y are weights corresponding to each 
axis. In a similar manner, another distance is com- 
puted i.e., the distance is taken from the border of the 
bounding boxes. Based on the framework, in what fol- 
lows, we explain three different algorithms i.e., A1:A3. 

Al. Grouping by k-NN. 

It employs a classical fc-NN algorithm where parame- 
ters fc and a threshold i.e., max -dist. The k nearest 
neighbours are taken into account if they are closer 
than the pre-defined max _dist. The distance parame- 
ter basically prevents far away neighbours to interfere 
with the component. In our case, max _dist. has been 
fixed to 1) 300 pixels for distance 1, and 2) 100 pixels 
for distance 2 with images at 300 dots per inch (dpi). 
Note that the distance 2 is lower than distance 1, and 
depends of the relative positioning between the bound- 
ing boxes and their sizes. These thresholds however, 
are image resolution dependent. 

A2. Grouping by the NN with constraints. 
The algorithm can be improved by avoiding big com- 
ponents that are basically be corrupted by small ones 
(as noise). Before flipping the label of the component, 
we perform a test to check whether the accumulated 
pixels of a neighbour contributing the change of label 
is significant in comparison to the number of pixels of 
the tested component. For this, in our test, the sum 
should be at least 50% of the main component. Note 
that the opposite does not exist. Big components are 
regrouped with small ones to help gathering main text 
with small components as commas, apostrophes or ac- 
cents. Moreover, big components contain more infor- 
mation so they are generally more reliable, and thus 
the classification is more accurate. An overall idea is 
presented in Algorithm 2. 

A3. Grouping by confidence voting. 
The classifier confidence helps to maintain the decision. 
Based on the idea of grouping via nearest neighbours 
in addition with some specific constraints, we examine 
the confidence of the nearest neighbour of a selected 
pseudo-word. If the latter is stronger than that of the 
pseudo-word, then it takes the neighbourhood class. A 
Gaussian or polynomial law can weight the neighbour 
confidence by its distance to the pseudo-word. 



3 Experiments 

3.1 Dataset and evaluation metric 

Dataset. To perform the tests, we have selected 75 
documents for learning and a 300 documents for test- 
ing. As a reminder, these samples are taken from the 
real-world industrial problem. 

Evaluation metric. Our evaluation of T-LhV separa- 
tion is performed according to the measure proposed 
by [12]. All test documents have been perfectly la- 
belled at pixel level, where performance is evaluated in 
terms of recognition rate. 



Recognition rate 



of pixels correctly labelled 



# of pixels used 
3.2 Results and analysis 



(2) 



Table 1 shows recognition rates for four grouping 
methods. The fc-NN uses fc = 2. The methods' con- 
fidence use respectively f gaU ss, f P oi y 2 and f po i y4 as 
weighting functions. 



f 9 auss(conf, dist) = conf x exp 



10~ 3 * dist 2 
conf 2 
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(3) 



f P oiy2(conf, dist) = -5 • 10" 4 ( _ £ ) + conf (4) 



f P oiy4( con f, dist) = -10 



conf 

dist — 1 
conf 



conf (5) 



Based on reported results in Table 1, we observe the 
following: 

1. We note that the classification by fc-NN provides 
better results as expected the recognition rate 
of double smearing i.e., segmentation without re- 
grouping. In contrast, methods based on confi- 
dence degrades performance. This is mainly due 
to the fact only local vicinity (a single neighbour) 
is taken into account, that makes misclassification 
possible. 

2. In our study, we have found that handwritten 
mixes with printed and other cases where grouping 
changes the isolated handwritten annotations la- 
bel (e.g., a figure or a symbol). In this situation, 
we are required more contextual information in- 
cluding the better interpretation, which is beyond 
the scope of current work. 



Table 1. Evaluation of four grouping methods. 



Recognition rate 


Hand. 


Print. 


Noise 


Average 


Double smearing 


96.1 


98.5 


35.7 


89.48 


fc-NN 


93.4 


98.3 


27.3 


89.54 


fcNN with constraints 


99.3 


99.0 


27.9 


90.68 


Gaussian confidence 


94.5 


97.7 


27.2 


87.49 


Poly confidence2 & 4 


93.5 


97.7 


14.2 


86.06 



On the whole, for visual understanding, we provide 
a few examples of T-LhV text separation in Fig. 4. Fur- 
thermore, Fig. 5 shows a comparison between four clas- 
sifiers: SVM, Tree C4.5 (J48 implementation), REP- 
Tree and NN. In this comparison, we have found that 



Figure 4. A few examples of T-LhV text separation, illustrating the robustness of the proposed approach. 
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Figure 5. Evaluation of four classifiers 



SVM performs the best, by providing marginal differ- 
ence with NN. This means that MLP can still be ap- 
plied. 

4 Conclusion 

In this paper, we have presented an approach to 
separate handwritten and machine-printed text from 
a scanned document in addition to the noise. The 
method is based on a double smearing technique to ob- 
tain the pseudo- words. These serve as a basis for clas- 
sification. For these words, descriptors are extracted 
where they all have a linear complexity with the num- 
ber of pixels. Descriptors are then fed into a multi-class 
SVM with a Gaussian kernel which provides the first 
label of each pseudo- word. A second analysis is carried 
out by studying the local vicinity of each pseudo-word 
that can change label if the neighbours are from an- 
other class. This integration allows context to correct 
several possible errors. In our test, we have found that 
the method is fc-NN with constraints where kd-tree has 
been used. 

Considering our small learning database, the results 
are fairly encouraging. This will certainly forecast an 
appropriate commercial application. Based on our re- 
ported results, a long-term approach about incremen- 
tal learning is one of the further issues. 
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